We live in a time when video games are extremely popular. The global video game market continues to grow year-on-year, and the industry is now valued at over $100 billion worldwide. With technology continuously pushing the boundaries, video games have only become more popular and more high-quality. Gameplay mechanics, cutting-edge graphics, and intricate storylines make today’s games more immersive than ever before. We chose this dataset to gain insights on the popularity of upcoming games.
Popular’ is our class label, we will use Global_Sales attribute to predict whether a game will sell 1000000 or more globally.
Our data mining task is to predict the popularity of upcoming games using regression.
The dataset provided by vgchartz.com supply us with a valuable resource to explore the platforms and genres of the top 16599 global video games. Through it, we can analyze the most popular platforms and genres that are influencing global sales, and detectr how regions’ sales affect global sales.
Our goal from studying this dataset is to utilize regression techniques on the input data to make predictions about the popularity of upcoming games.
| Attributes name | Description | Data type |
|---|---|---|
| Rank | Ranking of the game based on global sales. | Numeric |
| Name | Name of the game. | Nominal |
| Platform | Platform the game was released on. | Nominal |
| Year | Year the game was released. | Ordinal |
| Genre | Genre of the game | Nominal |
| Publisher | Publisher of the game. | Nominal |
| NA_Sales | Sales of the game in North America | Numeric (ratio-scaled) |
| EU_Sales | Sales of the game in Europe | Numeric (ratio-scaled) |
| JP_Sales | Sales of the game in Japan | Numeric (ratio-scaled) |
| Other_Sales | Sales of the game in other regions | Numeric (ratio-scaled) |
| Global_Sales | Total sales of the game worldwide | Numeric (ratio-scaled) |
library(outliers)
library(dplyr)
library(Hmisc)
library(ggplot2)
library(mlbench)
library(caret)
options(max.print=9999999)
dataset=read.csv("Dataset/vgsales.csv")
checking number of rows and columns, and cheking dimensionality and coulumns names:
nrow(dataset)
[1] 16598
ncol(dataset)
[1] 11
dim(dataset)
[1] 16598 11
names(dataset)
[1] "Rank" "Name" "Platform" "Year" "Genre" "Publisher" "NA_Sales" "EU_Sales" "JP_Sales"
[10] "Other_Sales" "Global_Sales"
Dataset structure:
str(dataset)
'data.frame': 16598 obs. of 11 variables:
$ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
$ Name : chr "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
$ Platform : chr "Wii" "NES" "Wii" "Wii" ...
$ Year : chr "2006" "1985" "2008" "2009" ...
$ Genre : chr "Sports" "Platform" "Racing" "Sports" ...
$ Publisher : chr "Nintendo" "Nintendo" "Nintendo" "Nintendo" ...
$ NA_Sales : num 41.5 29.1 15.8 15.8 11.3 ...
$ EU_Sales : num 29.02 3.58 12.88 11.01 8.89 ...
$ JP_Sales : num 3.77 6.81 3.79 3.28 10.22 ...
$ Other_Sales : num 8.46 0.77 3.31 2.96 1 0.58 2.9 2.85 2.26 0.47 ...
$ Global_Sales: num 82.7 40.2 35.8 33 31.4 ...
sample of raw dataset(first 10 rows):
head(dataset, 10)
sample of raw dataset(last 10 rows):
tail(dataset, 10)
Five number summary of each attribute in our dataset:
summary(dataset)
Rank Name Platform Year Genre Publisher NA_Sales EU_Sales
Min. : 1 Length:16598 Length:16598 Length:16598 Length:16598 Length:16598 Min. : 0.0000 Min. : 0.0000
1st Qu.: 4151 Class :character Class :character Class :character Class :character Class :character 1st Qu.: 0.0000 1st Qu.: 0.0000
Median : 8300 Mode :character Mode :character Mode :character Mode :character Mode :character Median : 0.0800 Median : 0.0200
Mean : 8301 Mean : 0.2647 Mean : 0.1467
3rd Qu.:12450 3rd Qu.: 0.2400 3rd Qu.: 0.1100
Max. :16600 Max. :41.4900 Max. :29.0200
JP_Sales Other_Sales Global_Sales
Min. : 0.00000 Min. : 0.00000 Min. : 0.0100
1st Qu.: 0.00000 1st Qu.: 0.00000 1st Qu.: 0.0600
Median : 0.00000 Median : 0.01000 Median : 0.1700
Mean : 0.07778 Mean : 0.04806 Mean : 0.5374
3rd Qu.: 0.04000 3rd Qu.: 0.04000 3rd Qu.: 0.4700
Max. :10.22000 Max. :10.57000 Max. :82.7400
variance of numeric data:
var(dataset$NA_Sales)
[1] 0.6669712
var(dataset$EU_Sales)
[1] 0.2553799
var(dataset$JP_Sales)
[1] 0.0956607
var(dataset$Other_Sales)
[1] 0.03556559
var(dataset$Global_Sales)
[1] 2.418112
dataset2 <- dataset %>% sample_n(50)
tab <- dataset2$Platform %>% table()
precentages <- tab %>% prop.table() %>% round(3) * 100
txt <- paste0(names(tab), '\n', precentages, '%')
pie(tab, labels=txt , main = "Pie chart of Platform")
This pie chart illustrate platforms of global video games , We notice from the pie chart of platform attribute that releasing a game for PS users will increase the popularity of the game since it is the most common platform among gamers.
# coloring barplot and adding text
tab<-dataset$Genre %>% table()
precentages<-tab %>% prop.table() %>% round(3)*100
txt<-paste0(names(tab), '\n',precentages,'%')
bb <- dataset$Genre %>% table() %>% barplot(axisnames=F, main = "Barplot for Popular genres ",ylab='count',col=c('pink','blue','lightblue','green','lightgreen','red','orange','red','grey','yellow','azure','olivedrab'))
text(bb,tab/2,labels=txt,cex=1.5)
This barplot illustrates popularity of global video games genres ,In terms of genre, action games are the most popular, followed by sports and music games. It is safe to assume that a high number of genres of this nature exist due to their popularity and sales.
boxplot(dataset$NA_Sales , main="
BoxPlot for NA_Sales")
The boxplot of the NA_Sales (Sales of the game in north America) attribute indicates that the values are close to each other ,and there are a lot of outliers since the dataset represents all the north America sales of video games.
boxplot(dataset$EU_Sales, main="
BoxPlot for EU_Sales")
The boxplot of the EU_Sales (sales of the game in Europe) attribute indicates that the values are close to each other, and there are a lot of outliers since the dataset represents all the Europe sales of video games.
boxplot(dataset$JP_Sales , main="
BoxPlot for JP_Sales")
The boxplot of the JP_Sales (sales of the game in Japan) attribute indicates that the values are close to each other, and there are a lot of outliers since the dataset represents all the Japan sales of video games.
boxplot(dataset$Other_Sales , main="
BoxPlot for Other_Sales")
The boxplot of the Other-sales attribute indicate that the values are close to each other ,and there is a lot of outliers since the dataset represents the global sales of video games.
boxplot(dataset$Global_Sales , main="BoxPlot for Global_Sales")
The boxplot of the Global-sales attribute indicate that the values are close to each other ,and there is a lot of outliers since the dataset represents the global sales of video games.
qplot(data = dataset, x=Global_Sales,y=Genre,fill=I("yellow"),width=0.5 ,geom = "boxplot" , main = "BoxPlots for genre and Global_Sales")
In the boxplot we can see that all the genres have Glob_ sales close to each other, but we notice an outlier that reaches more than 80 Glob_ sales which is a game with genre sports.
dataset$Year %>% table() %>% barplot( main = "Barplot for year")
The barplot of year illustrate that the number of video games were low from 1980 until 2000 , then number of games increased to more than 1200 till 2012.
pairs(~NA_Sales + EU_Sales + JP_Sales + Other_Sales + Global_Sales, data = dataset,
main = "Sales Scatterplot")
We used Scatterplot to determine the type of correlation we have between the sales; we can see that the majority have positive correlation with each other.
dataset$Rank=as.character(dataset$Rank)
We transformed the Rank from numric to char,because we will use them as ordinal data.
we checked nulls values to know how many nulls values we have, so we can determine how we will deal with them.
sum(is.na(dataset$Rank))
[1] 0
NullRank<-dataset[dataset$Rank=="N/A",]
NullRank
checking for nulls in Rank (there is no nulls)
sum(is.na(dataset$Name))
[1] 0
NullName<-dataset[dataset$Name=="N/A",]
NullName
checking for nulls in name (there is no nulls)
sum(is.na(dataset$Platform))
[1] 0
NullPlatform<-dataset[dataset$Platform=="N/A",]
checking for nulls in Platform(there is no nulls)
sum(is.na(dataset$Year))
[1] 0
NullYear<-dataset[dataset$Year=="N/A",]
NullYear
checking for nulls in year we won’t delete the null and we will leave them as global constant because we want the sales data out of them.
sum(is.na(dataset$Genre))
[1] 0
NullGenre<-dataset[dataset$Genre=="N/A",]
NullGenre
checking for nulls in Genre(there is no nulls)
sum(is.na(dataset$Publisher))
[1] 0
NullPublisher<-dataset[dataset$Publisher=="N/A",]
NullPublisher
checking for nulls in Publisher. we won’t delete the null and we will leave them as global constant as it is because we want the sales data of them.
sum(is.na(dataset$NA_Sales))
[1] 0
NullNA_Sales<-dataset[dataset$NA_Sales=="N/A",]
NullNA_Sales
checking for nulls in NA_Sales (there is no nulls)
sum(is.na(dataset$EU_Sales))
[1] 0
NullEU_Sales<-dataset[dataset$EU_Sales=="N/A",]
NullEU_Sales
checking for nulls in EU_Sales (there is no nulls)
sum(is.na(dataset$JP_Sales))
[1] 0
NullJP_Sales<-dataset[dataset$JP_Sales=="N/A",]
NullJP_Sales
checking for nulls in JP_Sales (there is no nulls)
sum(is.na(dataset$Other_Sales))
[1] 0
NullOther_Sales<-dataset[dataset$Other_Sales=="N/A",]
There is no null values in the other_sales.
sum(is.na(dataset$Global_Sales))
[1] 0
NullGlobal_Sales<-dataset[dataset$Global_Saless=="N/A",]
There is no null values in the Global_Sales.
We will encode our categorical data since most machine learning algorithms work with numbers rather than text.
dataset$Platform=factor(dataset$Platform,levels=c("2600","3DO","3DS","DC","DS","GB","GBA","GC","GEN","GG","N64","NES","NG","PC","PCFX","PS","PS2","PS3","PS4","PSP","PSV","SAT","SCD","SNES","TG16","Wii","WiiU","WS","X360","XB","XOne"), labels=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31))
this column will be encoded to facilitate our data mining task.
dataset$Genre=factor(dataset$Genre,levels=c("Action","Adventure","Fighting","Platform","Puzzle","Racing","Role-Playing","Shooter","Simulation","Sports","Strategy","Misc"),labels=c(1,2,3,4,5,6,7,8,9,10,11,12))
Since most machine learning algorithms work with numbers and not with text or categorical variables, this column will be encoded to facilitate our data mining task.
Analyses and statistical models can be ruined by outliers, making it difficult to detect a true effect. Therefore, we are checking for them and removing them if we find any.
outlier of NA_Sales
OutNA_Sales = outlier(dataset$NA_Sales, logical =TRUE)
sum(OutNA_Sales)
[1] 1
Find_outlier = which(OutNA_Sales ==TRUE, arr.ind = TRUE)
outlier of EU_Sales
OutEU_Sales = outlier(dataset$EU_Sales, logical =TRUE)
sum(OutEU_Sales)
[1] 1
Find_outlier = which(OutEU_Sales ==TRUE, arr.ind = TRUE)
outlier of JP_Sales
OutJP_Sales = outlier(dataset$JP_Sales, logical =TRUE)
sum(OutJP_Sales)
[1] 1
Find_outlier = which(OutJP_Sales ==TRUE, arr.ind = TRUE)
outlier of other_sales
OutOS=outlier(dataset$Other_Sales, logical=TRUE)
sum(OutOS)
[1] 1
Find_outlier=which(OutOS==TRUE, arr.ind=TRUE)
outlier of Global_sales
OutGS=outlier(dataset$Global_Sales, logical=TRUE)
sum(OutGS)
[1] 1
Find_outlier=which(OutGS==TRUE, arr.ind=TRUE)
dataset= dataset[-Find_outlier,]
The normalization of data will improve the performance of many machine learning algorithms by accounting for differences in the scale of the input features.
Dataset before normalization:
datsetWithoutNormalization<-dataset
normalize <- function(x) {return ((x - min(x)) / (max(x) - min(x)))}
dataset$NA_Sales<-normalize(datsetWithoutNormalization$NA_Sales)
dataset$EU_Sales<-normalize(datsetWithoutNormalization$EU_Sales)
dataset$JP_Sales<-normalize(datsetWithoutNormalization$JP_Sales)
dataset$Other_Sales<-normalize(datsetWithoutNormalization$Other_Sales)
dataset$Global_Sales<-normalize(datsetWithoutNormalization$Global_Sales)
We chose min-max normalization instead of z-score normalization because min-max transform the data into a specific range, which enhances its suitability for visualization and comparison. Additionally, it simplifies the process of assessing attribute importance and their contributions to the model.
Our class label (popular) refers to Global_Sales.because we have multiple regions sales we chose to evaluate each region sales based on their importance to (global_sales) column,and those that are less important will be deleted from the dataset.
Use roc_curve area as score
roc_imp <- filterVarImp(x = dataset[,7:10], y = dataset$Global_Sales)
Sort the score in decreasing order
roc_imp <- data.frame(cbind(variable = rownames(roc_imp), score = roc_imp[,1]))
roc_imp$score <- as.double(roc_imp$score)
roc_imp[order(roc_imp$score,decreasing = TRUE),]
we will remove the (JP_Sales) because it is of low importance to our class_label(Global_Sales)
dataset<- dataset[,-9]
#Discritization
dataBeforDiscertize=(dataset[,7:10])
library("arules")
dataAfterDiscertize=discretizeDF(
dataBeforDiscertize)
Warning: The calculated breaks are: 0, 0, 0.0046583850931677, 1
Only unique breaks are used reducing the number of intervals. Look at ? discretize for details.Warning: The calculated breaks are: 0, 0, 0.00189214758751183, 1
Only unique breaks are used reducing the number of intervals. Look at ? discretize for details.
unique(dataAfterDiscertize[,1])
[1] [0.0055,1] [0.000688,0.0055) [0,0.000688)
Levels: [0,0.000688) [0.000688,0.0055) [0.0055,1]
unique(dataAfterDiscertize[,2])
[1] [0.00466,1] [0,0.00466)
Levels: [0,0.00466) [0.00466,1]
unique(dataAfterDiscertize[,3])
[1] [0.00189,1] [0,0.00189)
Levels: [0,0.00189) [0.00189,1]
unique(dataAfterDiscertize[,4])
[1] [0.00795,1] [0.00199,0.00795) [0,0.00199)
Levels: [0,0.00199) [0.00199,0.00795) [0.00795,1]
levels(dataAfterDiscertize$NA_Sales)<-c("low","medium","high")
levels(dataAfterDiscertize$EU_Sales)<-c("low","high")
levels(dataAfterDiscertize$Other_Sales)<-c("low","high")
levels(dataAfterDiscertize$Global_Sales)<-c("low","medium","high")
#Balancing
library(groupdata2)
dataset<-downsample(dataset,cat_col="Global_Sales")
print(dataset)
We performed balancing and discritization because we notice from the graphs that our dataset is inbalanced.
The goal of classification is to build a model or algorithm that can generalize patterns and relationships observed in the training data to make accurate predictions on unseen data. The model learns from the labeled examples in the training set, where each example consists of a set of input features and a corresponding known class label.
By using Information Gain to select attributes for splitting, the decision tree algorithm aims to create a tree that maximizes the information provided by the attributes about the class labels, leading to effective classification.
Clustering is a technique used to group similar data points together based on their inherent characteristics or similarities. So our goal of clustering is to identify patterns, structures, or relationships within a dataset without any prior knowledge of the groups or classes that may exist.
before starting the clustring process we need to remove the class label since clustring is an unsupervised learning , but before removing the class label We stored it in a varible just in case of further need(We need it to compute Bcubed precision and recall),Then we need to transform each factor coulnm to numeric because it’s essential to convert factor variables to numeric ones due to the algorithmic requirements of clustring(Kmeans) and the characteristics of factor variables.
# We stored the class label in a varible just in case of further need(We need it to compute Bcubed precision and recall)
classLabel<-dataset$Global_Sales
# Removing the classLabel before the clustring process
dataset<- dataset[,-10]
# We removed columns that are not relevant to the clustering process and can distort the result
datasetClustering <- dataset[, setdiff(3:9, c(4, 6))]
View(datasetClustering)
##converting factors to numric to apply kmeans method , it's essential to convert factor variables to numeric ones due to the algorithmic requirements of K-means and the characteristics of factor variables.
datasetClustering$Platform <- as.numeric(as.character(datasetClustering$Platform))
datasetClustering$Genre <- as.numeric(as.character(datasetClustering$Genre))
View(datasetClustering)
After preprocessing the data now we will start performin the clustring technique on the processsed dataset.
We chose K-means clustering as our clustring method because it excels in handling large datasets, offering prompt and easily understandable insights. It is beneficial for exploring data, facilitating the quick detection of potential data clusters.
This graph depicts the process of finding the optimal number of clusters for a dataset using the Silhouette method. The x-axis represents the number of clusters (k) considered in the analysis, ranging from 1 to 10. The y-axis shows the average Silhouette width, which is a measure of how similar an object is to its own cluster compared to other clusters.
fviz_nbclust(datasetClustering, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette method")
The plot shows a peak at k=3, where the average Silhouette score is the highest. This suggests that the data points are, on average, closer to other points in their own cluster and farther from points in other clusters when the data is divided into three clusters. As a result, according to the Silhouette method, k=3 is the optimal number of clusters.
The Elbow Method using Within-Cluster Sum of Squares (WSS) is a technique to determine the optimal number of clusters in K-means clustering. It involves running the clustering algorithm for a range of cluster numbers and calculating the WSS for each. WSS is the sum of squared distances of each point to its cluster centroid. As the number of clusters increases, WSS tends to decrease; the goal is to find the point where increasing the number of clusters does not lead to a significant decrease in WSS. This point, visually resembling an elbow on a plot of WSS against the number of clusters, is considered the optimal number of clusters.
fviz_nbclust(datasetClustering, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2)+
labs(subtitle = "Elbow method")
As shown in the above graph , 4 is the value that resembles an elbow in the plot(The turnning point) wich means it is the optimal value of K the we will use in our clustring process.
In conclusion, we will choose K=4 for our clustering process, as it marks the turning point on the Elbow Method curve, indicating an optimal balance in cluster compactness and separation. Additionally, we will utilize K=3 and K=6, as these values maximize the average silhouette width, with K=3 being the primary maximizer and K=6 the secondary. By selecting these specific K values, we aim to achieve a satisfactory level of precision and recall in our clustering analysis, ensuring both the relevance and completeness of the clustered data.
set.seed(5000)
kmeans.result <- kmeans(datasetClustering, 3)
# print the clusterng result
kmeans.result
K-means clustering with 3 clusters of sizes 153, 316, 153
Cluster means:
Platform Genre NA_Sales EU_Sales Other_Sales
1 5.830065 5.562092 0.08035754 0.10050542 0.02601394
2 17.003165 5.471519 0.06264909 0.09720939 0.04795037
3 27.967320 6.509804 0.10228488 0.12072220 0.04096561
Clustering vector:
[1] 2 1 1 1 1 3 3 2 3 1 3 3 3 2 2 2 2 1 2 1 3 2 1 2 2 2 2 1 2 1 2 2 2 3 2 3 1 2 3 1 3 1 1 3 2 3 2 2 2 2 3 1 3 2 2 2 3 1 3 1 1 1 2 3 1 2 2 3 2 1 3 2
[73] 2 2 1 2 1 2 2 1 2 2 3 2 2 1 2 3 1 3 1 2 3 3 1 2 3 2 2 2 2 3 2 3 3 1 3 3 2 1 1 2 3 2 2 2 2 2 3 2 3 2 1 2 2 1 2 1 1 1 3 1 2 1 1 3 2 1 2 2 2 1 1 2
[145] 2 2 2 1 1 3 3 2 2 3 2 2 2 1 2 1 1 3 2 1 3 1 2 1 2 2 2 1 3 2 2 1 2 2 2 3 2 1 1 2 2 1 2 2 1 2 3 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 3 2 1 1 2 2 2
[217] 1 3 3 1 3 3 2 1 2 2 1 2 2 1 2 1 3 3 1 1 3 2 1 2 2 2 2 2 2 3 1 2 3 2 2 2 1 1 2 3 3 2 1 2 3 2 3 2 3 2 1 3 2 3 3 2 2 3 1 2 2 2 2 2 1 2 2 2 3 2 3 2
[289] 3 3 3 3 2 3 1 2 1 2 2 3 2 3 3 1 2 2 3 2 2 2 1 1 3 2 2 2 2 2 2 2 1 2 3 2 3 2 3 2 2 2 2 1 2 2 3 2 3 1 3 2 3 2 3 2 2 2 2 1 2 2 2 2 1 2 1 2 1 2 2 2
[361] 2 2 3 2 1 3 1 2 2 2 2 2 2 1 2 2 3 1 2 2 1 2 3 2 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 3 1 2 1 2 2 2 3 2 3 3 2 2 2 2 1 2 2 3 1 1 3 3 2 3 2 3 2 2 3 2 2 2
[433] 2 2 2 2 1 1 2 2 1 2 2 2 2 2 1 2 2 2 3 1 1 1 3 3 1 2 2 2 1 3 2 2 1 1 2 2 2 1 1 1 1 2 3 2 3 2 2 2 1 2 2 3 1 2 2 2 2 2 3 3 2 2 3 1 3 3 1 2 1 3 2 2
[505] 2 2 2 1 2 2 2 3 3 2 2 2 3 1 3 1 2 2 2 2 3 3 3 3 3 1 2 2 2 2 3 2 1 1 2 2 3 1 2 2 1 3 3 3 3 1 1 3 3 3 2 3 2 2 3 1 1 3 3 3 1 1 3 2 2 2 1 2 3 1 1 3
[577] 2 1 1 2 3 1 1 2 3 2 2 3 3 2 2 1 1 3 2 1 1 1 2 3 2 1 1 1 3 2 2 3 3 3 1 1 1 2 3 3 1 1 1 3 3 2
Within cluster sum of squares by cluster:
[1] 2398.331 5091.704 2840.916
(between_SS / total_SS = 78.5 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size" "iter" "ifault"
# visualize clustering
library(factoextra)
fviz_cluster(kmeans.result, data = datasetClustering)
#average silhouette for cluster k=3
library(cluster)
avg_sil <- silhouette(kmeans.result$cluster,dist(datasetClustering))
fviz_silhouette(avg_sil)
#Within-cluster sum of squares wss
wss <- kmeans.result$tot.withinss
print(wss)
[1] 10330.95
#BCubed
kmeans_cluster <- c(kmeans.result$cluster)
ground_truth <- c(classLabel)
data <- data.frame(cluster = kmeans_cluster, label = ground_truth)
# Function to calculate BCubed precision and recall
bcubed <- function(data) {
n <- nrow(data)
total_precesion <- 0
total_recall <- 0
for (i in 1:n) {
cluster <- data$cluster[i]
label <- data$label[i]
# Count the number of items from the same category within the same cluster
intersection <- sum(data$label[data$cluster == cluster] == label)
# Count the total number of items in the same cluster
total_same_cluster <- sum(data$cluster == cluster)
# Count the total number of items with the same category
total_same_category <- sum(data$label == label)
# Calculate precision and recall for the current item and add them to the sums
total_precesion <- total_precesion + intersection /total_same_cluster
total_recall <- total_recall + intersection / total_same_category
}
# Calculate average precision and recall
precision <- total_precesion / n
recall <- total_recall / n
return(list(precision = precision, recall = recall))
}
# Calculate BCubed precision and recall
metrics <- bcubed(data)
# Extract precision and recall from the metrics
precision <- metrics$precision
recall <- metrics$recall
# Print the results
cat("BCubed Precision:", precision, "\n")
BCubed Precision: 0.004823151
cat("BCubed Recall:", recall, "\n")
BCubed Recall: 1
As the graph of K=3 illustrated , there is a noticeable overlapping between the clusters that effect the cluster performance duo to the similarity between clusters and wide distance in the cluster itself as the high value of wss indicate(77%) ,the recall value is acceptable which is 0.39 , the value of precision 0.021 is low which can be duo to the presence of outliers ,the value of average silhouette width is 0.52 relatively good for the clustering process, Overall, the plot suggests that dividing the data into 3 clusters seems appropriate because the average Silhouette score is relativley high.
set.seed(5000)
kmeans.result <- kmeans(datasetClustering, 4)
# print the clusterng result
kmeans.result
K-means clustering with 4 clusters of sizes 141, 178, 153, 150
Cluster means:
Platform Genre NA_Sales EU_Sales Other_Sales
1 16.936170 1.879433 0.07084150 0.09965971 0.04933003
2 16.955056 8.387640 0.05618750 0.09439772 0.04611312
3 27.967320 6.509804 0.10228488 0.12072220 0.04096561
4 5.726667 5.480000 0.08067859 0.10160455 0.02645853
Clustering vector:
[1] 1 4 4 4 4 3 3 2 3 4 3 3 3 2 2 1 1 4 1 4 3 2 4 2 2 1 2 4 2 4 2 2 1 3 1 3 4 2 3 4 3 4 4 3 2 3 2 2 2 2 3 4 3 2 2 2 3 4 3 4 4 4 2 3 4 2 1 3 2 4 3 2
[73] 2 1 4 1 4 2 2 4 2 2 3 1 1 4 2 3 4 3 4 2 3 3 4 2 3 1 2 2 1 3 2 3 3 4 3 3 1 4 4 2 3 1 2 2 2 1 3 2 3 1 4 1 1 4 1 4 4 4 3 4 2 4 4 3 2 4 2 2 2 4 4 2
[145] 1 1 1 4 4 3 3 1 2 3 1 1 2 4 1 4 4 3 1 4 3 4 2 4 1 2 2 4 3 2 2 4 1 2 1 3 2 4 2 2 1 4 2 1 4 2 3 2 2 2 2 2 2 1 4 1 4 1 2 1 2 2 2 2 1 3 1 4 4 2 2 2
[217] 2 3 3 4 3 3 1 4 1 2 4 2 1 4 2 4 3 3 4 4 3 1 4 2 2 2 2 2 2 3 4 2 3 2 1 2 4 4 2 3 3 2 4 1 3 2 3 1 3 2 4 3 1 3 3 1 1 3 4 2 1 1 1 2 4 1 1 1 3 2 3 2
[289] 3 3 3 3 2 3 4 2 4 1 1 3 1 3 3 4 2 1 3 1 1 2 4 4 3 1 1 1 2 2 1 1 4 1 3 2 3 1 3 1 1 2 2 4 1 2 3 2 3 4 3 1 3 1 3 1 1 1 2 4 2 1 2 2 4 2 4 2 4 1 1 2
[361] 2 1 3 1 4 3 4 2 2 1 2 1 2 4 1 2 3 4 2 2 4 1 3 1 3 3 3 2 2 1 1 1 1 2 2 1 1 1 1 3 4 2 4 2 2 1 3 2 3 3 1 1 1 2 4 2 1 3 4 4 3 3 1 3 1 3 2 1 3 2 2 2
[433] 1 1 1 2 4 4 2 1 4 2 1 2 2 1 4 1 1 2 3 4 4 4 3 3 4 2 2 1 4 3 2 2 4 4 2 2 2 2 4 4 4 2 3 1 3 1 1 2 4 1 1 3 4 1 2 1 1 1 3 3 1 2 3 4 3 3 4 2 4 3 1 1
[505] 2 2 1 4 2 2 1 3 3 1 2 2 3 4 3 4 1 1 2 2 3 3 3 3 3 4 1 1 2 2 3 1 4 4 2 2 3 4 2 1 4 3 3 3 3 4 4 3 3 3 2 3 2 2 3 4 4 3 3 3 4 4 3 1 2 2 4 2 3 4 4 3
[577] 2 4 4 1 3 4 4 2 3 1 2 3 3 2 2 4 4 3 2 4 4 4 1 3 1 4 4 4 3 1 1 3 3 3 4 4 4 2 3 3 4 4 4 3 3 1
Within cluster sum of squares by cluster:
[1] 729.254 1191.980 2840.916 2262.303
(between_SS / total_SS = 85.3 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size" "iter" "ifault"
# visualize clustering
library(factoextra)
fviz_cluster(kmeans.result, data = datasetClustering)
#average silhouette for cluster k=4
library(cluster)
avg_sil <- silhouette(kmeans.result$cluster,dist(datasetClustering))
fviz_silhouette(avg_sil)
#Within-cluster sum of squares wss
wss <- kmeans.result$tot.withinss
print(wss)
[1] 7024.453
#BCubed
kmeans_cluster <- c(kmeans.result$cluster)
ground_truth <- c(classLabel)
data <- data.frame(cluster = kmeans_cluster, label = ground_truth)
# Function to calculate BCubed precision and recall
bcubed <- function(data) {
n <- nrow(data)
total_precesion <- 0
total_recall <- 0
for (i in 1:n) {
cluster <- data$cluster[i]
label <- data$label[i]
# Count the number of items from the same category within the same cluster
intersection <- sum(data$label[data$cluster == cluster] == label)
# Count the total number of items in the same cluster
total_same_cluster <- sum(data$cluster == cluster)
# Count the total number of items with the same category
total_same_category <- sum(data$label == label)
# Calculate precision and recall for the current item and add them to the sums
total_precesion <- total_precesion + intersection /total_same_cluster
total_recall <- total_recall + intersection / total_same_category
}
# Calculate average precision and recall
precision <- total_precesion / n
recall <- total_recall / n
return(list(precision = precision, recall = recall))
}
# Calculate BCubed precision and recall
metrics <- bcubed(data)
# Extract precision and recall from the metrics
precision <- metrics$precision
recall <- metrics$recall
# Print the results
cat("BCubed Precision:", precision, "\n")
BCubed Precision: 0.006430868
cat("BCubed Recall:", recall, "\n")
BCubed Recall: 1
As the graph of K=4 illustrated , there is a noticeable overlapping between the clusters that effect the cluster performance duo to the similarity between clusters and some wide distance in the cluster itself as the high value of wss indicate(83%) ,the recall value is acceptable which is 0.28 but less than K=3 recall value , the value of precision 0.025 is low which can be duo to the presence of outliers ,the value of average silhouette width is 0.52 which is relatively good for the clustering process, Overall, the plot suggests that dividing the data into 4 clusters seems appropriate but it would better if choose k=3 because it has lower wss value and higher precision and recall .
set.seed(5000)
kmeans.result <- kmeans(datasetClustering, 6)
# print the clusterng result
kmeans.result
K-means clustering with 6 clusters of sizes 41, 155, 97, 138, 135, 56
Cluster means:
Platform Genre NA_Sales EU_Sales Other_Sales
1 12.097561 7.634146 0.11401684 0.08847144 0.01571405
2 17.516129 8.341935 0.05254249 0.09515127 0.05041658
3 28.123711 9.092784 0.11251223 0.13506916 0.04713788
4 5.347826 5.326087 0.07978490 0.10445922 0.02807371
5 17.155556 1.770370 0.06120536 0.09855072 0.05036617
6 27.696429 2.035714 0.08456966 0.09587123 0.03027436
Clustering vector:
[1] 5 4 4 4 4 3 3 1 6 4 3 3 6 2 2 5 5 1 5 4 3 2 4 2 2 5 2 4 2 4 2 1 5 6 5 3 4 2 6 4 6 4 4 6 2 3 2 2 2 2 3 4 3 2 2 2 3 4 6 4 4 4 2 3 4 2 5 6 2 4 3 2
[73] 1 5 4 5 4 2 2 4 2 2 3 5 5 4 2 3 4 6 4 2 3 6 4 2 6 5 2 2 5 6 2 6 3 4 6 6 5 4 1 2 3 5 1 2 1 5 6 2 6 5 4 5 5 4 5 4 4 4 3 4 2 4 4 3 2 1 2 2 2 4 4 2
[145] 5 5 5 1 4 6 3 5 2 6 5 1 2 4 5 4 4 3 5 4 3 4 2 4 5 2 2 4 6 2 2 4 5 2 5 3 2 4 1 2 5 4 2 5 4 2 3 1 2 2 2 1 2 5 4 5 4 5 2 5 2 2 2 1 5 6 5 4 4 1 2 1
[217] 1 3 6 4 3 6 5 4 5 2 4 2 5 4 2 4 3 3 4 4 3 5 4 1 2 2 2 2 2 3 1 2 6 2 5 1 4 4 2 3 6 2 4 5 3 2 6 5 3 2 4 3 5 3 6 5 5 3 4 2 5 5 5 2 4 5 5 5 3 2 3 2
[289] 3 3 6 3 2 6 4 1 4 5 5 6 5 3 6 4 1 5 3 5 5 2 1 4 6 5 5 5 2 2 5 5 4 5 3 2 3 5 6 5 5 2 2 4 5 2 3 2 6 4 3 5 6 5 3 5 5 5 2 4 2 5 2 2 1 2 4 2 4 5 5 2
[361] 1 5 6 5 4 3 4 2 2 5 2 5 2 4 5 2 3 4 2 2 1 5 3 5 3 3 6 2 1 5 5 5 5 2 2 5 5 5 5 3 4 2 4 2 2 5 3 2 3 3 5 5 5 2 4 2 5 3 4 4 6 3 5 3 5 3 2 5 6 2 2 2
[433] 5 5 1 2 1 4 2 5 4 2 5 2 2 5 4 5 5 1 3 4 4 4 6 6 4 2 2 5 4 6 2 2 4 4 2 2 2 1 4 4 4 2 3 5 6 5 1 2 4 5 5 6 4 5 2 5 5 5 3 3 5 1 6 4 6 3 4 2 4 3 5 5
[505] 2 2 5 4 2 2 5 3 3 5 2 2 3 4 3 4 5 5 2 2 3 3 6 3 3 4 1 5 2 2 6 5 4 4 2 2 3 1 1 5 4 3 3 3 3 4 4 3 6 3 2 3 2 2 3 4 1 3 3 3 4 4 6 5 2 2 4 2 6 4 4 6
[577] 2 1 4 5 3 4 4 2 6 5 2 3 3 2 2 4 4 3 2 4 4 4 5 6 1 4 4 4 6 5 5 3 3 3 4 4 4 1 6 3 4 4 4 3 3 1
Within cluster sum of squares by cluster:
[1] 361.4029 726.4409 697.2027 1868.5574 536.4539 368.9975
(between_SS / total_SS = 90.5 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size" "iter" "ifault"
# visualize clustering
library(factoextra)
fviz_cluster(kmeans.result, data = datasetClustering)
#average silhouette for cluster k=6
library(cluster)
avg_sil <- silhouette(kmeans.result$cluster,dist(datasetClustering))
fviz_silhouette(avg_sil)
#Within-cluster sum of squares wss
wss <- kmeans.result$tot.withinss
print(wss)
[1] 4559.055
#BCubed
kmeans_cluster <- c(kmeans.result$cluster)
ground_truth <- c(classLabel)
data <- data.frame(cluster = kmeans_cluster, label = ground_truth)
# Function to calculate BCubed precision and recall
bcubed <- function(data) {
n <- nrow(data)
total_precesion <- 0
total_recall <- 0
for (i in 1:n) {
cluster <- data$cluster[i]
label <- data$label[i]
# Count the number of items from the same category within the same cluster
intersection <- sum(data$label[data$cluster == cluster] == label)
# Count the total number of items in the same cluster
total_same_cluster <- sum(data$cluster == cluster)
# Count the total number of items with the same category
total_same_category <- sum(data$label == label)
# Calculate precision and recall for the current item and add them to the sums
total_precesion <- total_precesion + intersection /total_same_cluster
total_recall <- total_recall + intersection / total_same_category
}
# Calculate average precision and recall
precision <- total_precesion / n
recall <- total_recall / n
return(list(precision = precision, recall = recall))
}
# Calculate BCubed precision and recall
metrics <- bcubed(data)
# Extract precision and recall from the metrics
precision <- metrics$precision
recall <- metrics$recall
# Print the results
cat("BCubed Precision:", precision, "\n")
BCubed Precision: 0.009646302
cat("BCubed Recall:", recall, "\n")
BCubed Recall: 1
| Measure | K=3 | K=4 | K=6 |
|---|---|---|---|
| Average Silhouette width | 0.55 | 0.54 | 0.52 |
| Total within-cluster sum of square | 10330.95 | 7024.453 | 4559.055 |
| BCubed precision | 0.00482315 | 0.00643087 | 0.00964630 |
| BCubed recall | 1.00 | 1.00 | 1.00 |
As the graph of K=6 illustrated , there is a noticeable overlapping between the clusters that effect the cluster performance duo to the similarity between clusters and wide distance in the cluster itself as the high value of wss indicate (86%) ,the recall value is acceptable which is 0.22 and it is the lowest value between the other clusters, the value of precision 0.02 is low which can be duo to the presence of outliers, the value of average silhouette width is 0.4 which is bad for the clustering process it. Overall, the plot suggests that dividing the data into 3 clusters seems appropriate because the average Silhouette score is high and the individual data points generally have good Silhouette values.